Introduction to R Programming

STA6235: Modeling in Regression

Introduction

  • Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • This is a multiple regression model because it has multiple predictors (x_i).

    • A special case is simple linear regression, when there is a single predictor.
  • \beta_0 is the y-intercept, or the average outcome (y) when all x_i = 0.

  • \beta_i is the slope for predictor i and describes the relationship between the predictor and the outcome, after adjusting (or accounting) for the other predictors in the model.

Data for Today’s Lecture

library(tidyverse)
office_ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-17/office_ratings.csv')
head(office_ratings, n=4)

Constructing the Model in R

  • We will use the lm() function to construct the linear model,
m <- lm([outcome] ~ [pred1] + [pred2] + [pred3] + ..., 
        data = [dataset])
  • We can also use the glm() function,
m <- glm([outcome] ~ [pred1] + [pred2] + [pred3] + ..., 
         data = [dataset],
         family = "gaussian")
  • Then we run the model results through the summary() function to obtain information about the model,
summary(m)

Constructing the Model in R

  • Let’s model the IMDB rating as a function of season and episode number.

    • First, let’s use lm():
m1 <- lm(imdb_rating ~ season + episode, 
         data = office_ratings)
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09

Constructing the Model in R

  • Let’s model the IMDB rating as a function of season and episode number.

    • Now, let’s use glm():
m2 <- glm(imdb_rating ~ season + episode, 
         data = office_ratings,
         family = "gaussian")
summary(m2)

Call:
glm(formula = imdb_rating ~ season + episode, family = "gaussian", 
    data = office_ratings)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.238935)

    Null deviance: 54.140  on 187  degrees of freedom
Residual deviance: 44.203  on 185  degrees of freedom
AIC: 269.36

Number of Fisher Scoring iterations: 2

Stating the Model

  • Using the stated coefficients, we can state the model.
coefficients(m1)
(Intercept)      season     episode 
 8.60583499 -0.09328865  0.01361565 
  • Thus, the resulting model is \hat{y} = 8.61 - 0.09 x_1 + 0.01 x_2, where

    • y is the IMDB rating,

    • x_1 is the season number, and

    • x_2 is the episode number.

Stating the Model

  • Using the stated coefficients, we can state the model.
coefficients(m2)
(Intercept)      season     episode 
 8.60583499 -0.09328865  0.01361565 
  • Instead of using y and x_i, we can state it this way: \hat{\text{rating}} = 8.61 - 0.09 \text{ season} + 0.01 \text{ episode}

  • I prefer using this method when explaining models to collaborators.

  • For homework/projects, I do not care which method you use.

    • If you use y and x_i, you must define them in context of the dataset.

Interpretation of Slope

  • We want to put the slope into perspective for whoever we are collaborating with.

  • Basic interpretation: for every 1 [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ \left| \hat{\beta}_i \right| \right] [units of y].

    • We say that y is decreasing if \hat{\beta}_0 < 0 and y is increasing if \hat{\beta}_0 > 0.
  • We can also scale our interpretations. e.g.,

    • For every 7 [units of x_i] increase in [x_i], [y] [increases or decreases] by \left[ 7 \times \left| \hat{\beta}_i \right| \right] [units of y].
  • Note that in the case of multiple regression, there is an unspoken “after adjusting for everything else in the model” at the end of the sentence.

    • Always remember that we are controlling for all other predictors included in the predictor set.

Interpretation of Slope

  • Let’s interpret the slopes for the model we constructed.
coefficients(m1)
(Intercept)      season     episode 
 8.60583499 -0.09328865  0.01361565 
  • For a 1 season increase in season number, we expect IMDB ratings to decrease by 0.1 points.

    • For a 5 season increase in season number, we expect IMDB ratings to decrease by 0.47 points.
  • For a 1 episode increase in episode number, we expect IMDB ratings to increase by 0.01 points.

    • For a 10 episode increase in episode number, we expect IMDB ratings to increase by 0.14 points.

Interpretation of Intercept

  • The intercept is the average of the outcome when all predictors in the model are equal to 0.

  • In our example,

coefficients(m1)
(Intercept)      season     episode 
 8.60583499 -0.09328865  0.01361565 
  • The average IMDB rating for season 0, episode 0 is 8.61.

  • Is this reasonable?

Confidence Intervals for \beta_i

  • Recall confidence intervals – they allow us to determine how “good” our estimation is.

  • In general, CIs will take the form

    point estimate \pm margin of error

    • The margin of error is a critical value (e.g., t_{1-\alpha/2}) multiplied by the standard error of the point estimate.
  • Recall that the standard error accounts for the sample size.

  • In R, we will run the model results through the confint() function.

confint(m)

Confidence Intervals for \beta_i

  • In our example,
confint(m1)
                   2.5 %      97.5 %
(Intercept)  8.403710417  8.80795957
season      -0.123097499 -0.06347981
episode      0.003490008  0.02374130
  • We have the following CIs:

    • 95% CI for \beta_{\text{season}} is (-0.12, -0.06)
    • 95% CI for \beta_{\text{episode}} is (0.003, 0.024)

Confidence Intervals for \beta_i

  • We can change the confidence level by specifying the level.
confint(m1, level=0.99)
                    0.5 %      99.5 %
(Intercept)  8.3391865895  8.87248340
season      -0.1326133179 -0.05396399
episode      0.0002576176  0.02697369
confint(m1, level=0.8914)
                  5.43 %     94.57 %
(Intercept)  8.440650938  8.77101905
season      -0.117649600 -0.06892770
episode      0.005340582  0.02189073

Significant Regression Line

  • Hypotheses

    • H_0: \ \beta_1 = ... = \beta_k = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • F_0 and p; depends on lm() or glm().
  • Rejection Region

    • Reject H_0 if p < \alpha
  • Conclusion/Interpretation

    • [Reject (if p < \alpha) or fail to reject (if p \ge \alpha)] H_0. There [is (if reject) or is not (if FTR)] sufficient evidence to suggest that at least one slope is non-zero.

Significant Regression Line

  • Test Statistic and p-Value

    • F_0 and p; depends on lm() or glm(). 🧐
  • If lm():

m <- lm(y ~ pred1 + pred2 + ..., data = dataset)
summary(m)
  • If glm():
full <- glm(y ~ pred1 + pred2 + ..., data = dataset, family = "gaussian")
reduced <- glm(y ~ 1, data = dataset, family = "gaussian")
anova(reduced, full)

Significant Regression Line

  • Our example under lm()
m1 <- lm(imdb_rating ~ season + episode, data = office_ratings)
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09

Significant Regression Line

  • Our example under glm():
full <- glm(imdb_rating ~ season + episode, data = office_ratings, family = "gaussian")
reduced <- glm(imdb_rating ~ 1, data = office_ratings, family = "gaussian")
anova(reduced, full, test = "F")

Significant Regression Line

  • Hypotheses

    • H_0: \ \beta_1 = \beta_2 = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • F_0 = 20.794; p < 0.001.
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that at least one slope is non-zero.

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_i = 0
    • H_1: \ \beta_i \ne 0
  • Test Statistic and p-Value

    • t_0 and p from summary()
  • Rejection Region

    • Reject H_0 if p < \alpha

Significant Predictors of y

  • Let’s now determine which, if any, are significant predictors of IMDB ratings.
summary(m1)

Call:
lm(formula = imdb_rating ~ season + episode, data = office_ratings)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43672 -0.29274 -0.03135  0.26734  1.62060 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.605835   0.102452  83.999  < 2e-16 ***
season      -0.093289   0.015109  -6.174 4.11e-09 ***
episode      0.013616   0.005132   2.653  0.00868 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4888 on 185 degrees of freedom
Multiple R-squared:  0.1835,    Adjusted R-squared:  0.1747 
F-statistic: 20.79 on 2 and 185 DF,  p-value: 7.147e-09

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_{\text{season}} = 0
    • H_1: \ \beta_{\text{season}} \ne 0
  • Test Statistic and p-Value

    • t_0 = -6.174

    • p < 0.001

  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha=0.05
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that season significantly predicts IMDB rating.

Significant Predictors of y

  • Hypotheses

    • H_0: \ \beta_{\text{episode}} = 0
    • H_1: \ \beta_{\text{episode}} \ne 0
  • Test Statistic and p-Value

    • t_0 = 2.653

    • p = 0.009

  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha=0.05
  • Conclusion/Interpretation

    • Reject H_0. There is sufficient evidence to suggest that episode number (of the season) significantly predicts IMDB rating.

Predicted Values

  • Once we have a regression model, we can construct the predicted value (\hat{y}). Recall, \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + ... + \hat{\beta}_k x_k

  • If we plug in actual responses from the observations in the dataset, we can simplify and find the predicted value.

    • i.e., what did we expect the outcome to be given the characteristics of that observation?
  • In R, we will use the predict() function to add the predicted value to our dataset.

dataset <- dataset %>%
  mutate(y_hat = predict(m1))
  • Note that we can use the predict() function regardless of using lm() or glm() to construct the model.

Predicted Values

  • In our Office example,
office_ratings <- office_ratings %>%
  mutate(y_hat_m1 = predict(m1),
         y_hat_m2 = predict(m2))
head(office_ratings, n=5)

Visualizing the Results

  • It often helps for us to visualize the model we have constructed.

  • Recall that while we often refer to it as a “regression line,” it is actually a “response surface”

    • Visualization gets increasingly difficult as we add predictors to our model.
  • We must hold all but one variable (that will be on the x-axis) constant and create a predicted value.

    • This means that we will create a version of a predicted value.

    • Unfortunately there is not a function to do this.

  • In theory you could hard code everything, however, I prefer to save the output of the coefficients() function and call individual pieces (see example).

c <- coefficients(m)

Visualizing the Results

  • Let’s create a graph for our Office model.
(c1 <- coefficients(m1))
(Intercept)      season     episode 
 8.60583499 -0.09328865  0.01361565 
  • I personally think season is more interesting, so I will let that one vary.

  • This means that we need to plug in value(s) for episode.

    • I will create lines for episodes 1, 10, and 20.
office_ratings <- office_ratings %>%
  mutate(y_hat_ep1 = c1["(Intercept)"] + c1["season"]*season + c1["episode"]*1,
         y_hat_ep10 = c1["(Intercept)"] + c1["season"]*season + c1["episode"]*10,
         y_hat_ep20 = c1["(Intercept)"] + c1["season"]*season + c1["episode"]*20)

Visualizing the Results

  • Let’s check the dataset.
head(office_ratings)

Visualizing the Results

  • Let’s start building the graph.
office_ratings %>% 
  ggplot(aes(x = season)) +
  geom_point(aes(y = imdb_rating)) +
  geom_line(aes(y = y_hat_ep1), color = "blue") +
  geom_line(aes(y = y_hat_ep10), color = "purple") +
  geom_line(aes(y = y_hat_ep20), color = "darkgreen") +
  theme_bw()

office_ratings %>% 
  ggplot(aes(x = season)) +
  geom_point(aes(y = imdb_rating)) +
  geom_line(aes(y = y_hat_ep1), color = "blue") +
  geom_line(aes(y = y_hat_ep10), color = "purple") +
  geom_line(aes(y = y_hat_ep20), color = "darkgreen") +
  geom_text(aes(x = 9.75, y = 8.05, label = "Episode 20     ")) +
  geom_text(aes(x = 9.75, y = 7.90, label = "Episode 10     ")) +
  geom_text(aes(x = 9.75, y = 7.75, label = "Episode  1     ")) +
  scale_x_continuous(breaks = office_ratings$season) +
  labs(x = "Season", y = "IMDB Rating") +
  ylim(6, 10) +
  theme_bw()

office_ratings %>% 
  mutate(finale = if_else((season == 1 & episode == 6) | 
                          (season == 2 & episode == 22) | 
                          (season == 3 & episode == 23) | 
                          (season == 4 & episode == 14) |
                          (season == 5 & episode == 26) |
                          (season == 6 & episode == 26) |
                          (season == 7 & episode == 24) |
                          (season == 8 & episode == 24) |
                          (season == 9 & episode == 23), "Yes", "No")) %>%
  ggplot(aes(x = season)) +
  geom_point(aes(y = imdb_rating, color = as.factor(finale))) +
  geom_line(aes(y = y_hat_ep1)) +
  geom_line(aes(y = y_hat_ep10)) +
  geom_line(aes(y = y_hat_ep20)) +
  geom_text(aes(x = 9.75, y = 8.05, label = "Episode 20     ")) +
  geom_text(aes(x = 9.75, y = 7.90, label = "Episode 10     ")) +
  geom_text(aes(x = 9.75, y = 7.75, label = "Episode  1     ")) +
  scale_x_continuous(breaks = office_ratings$season) +
  labs(x = "Season", y = "IMDB Rating", color = "Finale") +
  ylim(6, 10) +
  theme_bw()

Visualizing the Results

  • Even with a simple example, graphing is still complicated.

  • In the example:

    • IMDB ratings must go on the y-axis as it is our outcome.

    • We selected season to be on the x-axis, however, we could have put episode number on the x-axis.

    • We graphed a few different episode numbers - we could limit it to one or we could expand it to include more.

  • You will see that graphing resulting regression results gets really complicated really fast simply due to the number of predictors in the model.

  • We can also do fancier things with graphs, like include confidence bands or overlay loess lines.

    • In class, I demonstrate the base “example” graphs I provide collaborators as a tool for discussion.

    • These are just starting points and I will edit graphs based on feedback and requests from collaborators.

Wrap Up

  • Today we reminded ourselves about multiple regression. We covered how to:

    • construct the model in R,

    • interpret the intercept and slope,

    • find the confidence interval for \beta_i,

    • determine if the regression line is significant,

    • determine if individual predictors are significant,

    • create predicted values, and

    • visualize regression results.

  • Next week, we will:

    • discuss assumptions and diagnostics of the linear model (Tuesday) and

    • introduce categorical predictors (Thursday).